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Abstract 

We consider the problem of learning unions of rectangles over the domain [6]", in the 
uniform distribution membership query learning setting, where both 5 and n are "large". 
We obtain poly(n, log 6)-time algorithms for the following classes: 

• poly(n log 6)-way MAJORITY of 0( ip'g k!'^(n fog fc) ) -dimensional rectangles. 

. Union of poly(log(nlog&)) many 0{ (iogiog(niogT)ll;tg?og(«iogb))^ )-dimensional rect- 
angles. 

• poly(nlog6)-way MAJORITY of poly(nlog &)-0r of disjoint 0( log k!g(ra fog b) ) dimen- 
sional rectangles. 

Our main algorithmic tool is an extension of Jackson's boosting- and Fourier-based Har- 
monic Sieve algorithm [12] to the domain [6]", building on work of Akavia et al. [1]. Other 
ingredients used to obtain the results stated above are techniques from exact learning [3] 
and ideas from recent work on learning augmented AC° circuits [13] and on representing 
Boolean functions as thresholds of parities [15]. 
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1 Introduction 



1.1 Motivation 

The learnability of Boolean valued functions defined over the domain 

[6r = {o,i,...,6-ir 

has long elicited interest in computational learning theory literature. In particular, 
much research has been done on learning various classes of "unions of rectangles" 
over [6]" (see e.g. BI5I6I9I12I18II ). where a rectangle is a conjunction of properties 
of the form "the value of attribute Xi lies in the range One motivation 

for studying these classes is that they are a natural analogue of classes of DNF 
(Disjunctive Normal Form) formulae over {0, 1}"; for instance, it is easy to see 
that in the case 6 = 2 any union of s rectangles is simply a DNF with s terms. 

Since the description length of a point x E [b]"' is nlogb bits, a natural goal in 
learning functions over [6] " is to obtain algorithms which run in time poly(r2 log b). 
Throughout the article we refer to such algorithms with poly(nlog6) runtime as 
efficient algorithms. In this article we give efficient algorithms which can learn 
several interesting classes of unions of rectangles over [fo]*^ in the model of uniform 
distribution learning with membership queries. 

1.2 Previous results 

In a breakthrough result a decade ago, Jackson fV]\ gave the Harmonic Sieve (HS) 
algorithm and proved that it can learn any s-term DNF formula over n Boolean 
variables in poly(n, s) time. In fact, Jackson showed that the algorithm can learn 
any s-way majority of parities in poly(n, s) time; this is a richer set of functions 
which includes all s-term DNF formulae. The HS algorithm works by boosting a 
Fourier-based weak learning algorithm, which is a modified version of an earlier 
algorithm due to Kushilevitz and Mansour [TTJ. 

In tfT2l Jackson also described an extension of the HS algorithm to the domain [6]". 
His main result for [6]" is an algorithm that can learn any union of s rectangles over 
[6]" in poly(s^'°s^°^^, n) time; note that this runtime is poly(r2, s) if and only if b is 
0(1) (and the runtime is clearly exponential in b for any s). 

There has also been substantial work on learning various classes of unions of rect- 
angles over [6]" in the more demanding model of exact learning from membership 
and equivalence queries. Some of the subclasses of unions of rectangles which have 
been considered in this setting are 
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The dimension of each rectangle is 0(1): Beimel and Kushilevitz established an 
algorithm learning any union of s 0(l)-dimensional rectangles over [6]" using 
equivalence queries only, in poly(n, s, log b) time steps [|3jl. 

The number of rectangles is limited: In [31 an algorithm is also given which ex- 
actly learns any union of O(logn) many rectangles in poly(ra, log 6) time using 
membership and equivalence queries. Earlier, Maass and Warmuth [fTSl gave an 
algorithm which uses only equivalence queries and can learn any union of 0(1) 
rectangles in poly(n, log 6) time. 

The rectangles are disjoint: If no input x G [6]" belongs to more than one rectan- 
gle, then Q can learn a union of s such rectangles in poly(n, s, log b) time with 
membership and equivalence queries. 

1.3 Our techniques and results 

Because efficient learnability is established for unions of O(logn) arbitrary dimen- 
sional rectangles by [3| in a more demanding model, we are interested in achiev- 
ing positive results when the number of rectangles is strictly larger. Therefore all 
the cases we study involve at least poly(log(nlog6)) and sometimes as many as 
poly(nlog6) rectangles. 

We start by describing a new variant of the Harmonic Sieve algorithm for learning 
functions defined over [b]"", we call this new algorithm the Generalized Harmonic 
Sieve, or GHS. The key difference between GHS and Jackson's algorithm for [6]" 
is that whereas Jackson's algorithm used a weak learning algorithm whose runtime 
is poly(6), the GHS algorithm uses a poly(log6) time weak learning algorithm 
described in recent work of Akavia et al. yj. 

We then apply GHS to learn various classes of functions defined in terms of "b- 
literals" (see Section[2]for a precise definition; roughly speaking a 6-literal is like a 
1 -dimensional rectangle). We first show the following result: 

Theorem 1.1 The concept class ofs-way Majority ofr-way Parity ofb-literals 
where s = poly(nlog6), r = 0{j^^^^j^^) is efficiently learnable using GHS. 

Learning this class has immediate applications for our goal of "learning unions of 
rectangles"; in particular, it follows that 

Theorem 1.2 The concept class of s -way Majority of r -dimensional rectangles 
where s = poly(n log 6), r = 0{ ^^^f^^^^f^^^^ ) is efficiently learnable using GHS. 

This clearly implies efficient learnability for unions (as opposed to majorities) of s 
such rectangles as well. 

We then employ a technique of restricting the domain [fe]" to a much smaller set 
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and adaptively expanding this set as required. This approach was used in the exact 
learning framework by Beimel and Kushilevitz [i3J; by an appropriate modification 
we adapt the underlying idea to the uniform distribution membership query frame- 
work. Using this approach in conjunction with GHS we obtain almost a quadratic 
improvement in the dimension of the rectangles if the number of terms is guaran- 
teed to be small: 

Theorem 1.3 The concept class of unions o/poly(log(n log h)) many r -dimensional 
rectangles where r = 0(r, — i — , , ^ i — ttvi) efficiently learnable via 

° ^ (log log(n log o) loglog log(n log o))^ ' ■' 

Algorithm 2 ( see Section [J]). 

Finally we consider the case of disjoint rectangles (also studied by [3J as mentioned 
above), and improve the depth of our circuits by 1 provided that the rectangles 
connected to the same Or gate are disjoint: 

Corollary 1.4 The concept class of s -way Majority oft-way Or of disjoint r- 
dimensional rectangles where s,t = poly(nlog6), r = Q( iogk!^(niogfc) ) '■^ ^ffi' 
ciently learnable under GHS. 

1.4 Organization 

In Section 3 we describe the Generalized Harmonic Sieve algorithm GHS which 
will be our main tool for learning unions of rectangles. In Section 4 we show that 
s-way Majority of r-way Parity of 6-literals is efficiently learnable using GHS 
for suitable r, s; this concept class turns out to be quite useful for learning unions 
of rectangles. In Section 5 we improve over the results of Section 4 slightly if 
the number of terms is small, by adaptively selecting a small subset of [b] in each 
dimension which is sufficient for learning, and invoke GHS over the restricted do- 
main. In Section 6 we explore the consequences of the results in Sections 4 and 5 
for the ultimate goal of learning unions of rectangles. 



2 Preliminaries 

2.1 The learning model 

We are interested in Boolean functions defined over the domain [6]", where [b] = 
{0,1,.. .,6 — 1}. We view Boolean functions as mappings into { — 1, 1} where — 1 
is associated with True and 1 with False. 

A concept class € is a collection of classes (sets) of Boolean functions {Cn,b ■ n > 
0,6 > 1} such that if / G Cn,b then /: [6]" {—1, 1}. As a simple example. 
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consider the case where b = 2 and d is the class of all monotone Boolean conjunc- 
tions; then for each n we have that Cn,b is the set of all Boolean conjunctions over 
a subset of the Boolean input variables Xi, . . . , x„. Throughout this article we view 
both n and b as asymptotic parameters, and our goal, as mentioned in Section fTTTl is 
to construct algorithms that learn various classes Cn,b in poly(?T,, log 6) time. (Note 
that given this goal, it only makes sense to attempt to learn concept classes such 
that each concept in the class has "description length" at most poly(n log 6) bits. It 
will be clear that this is the case for all the concept classes we consider.) We now 
describe the uniform distribution membership query learning model that we will 
consider. 

A membership oracle MEM(/) is an oracle which, when queried with input x, 
outputs the label f{x) assigned by the target / to the input. Let / G Cn,h be an 
unknown member of the concept class and let ^ be a randomized learning algo- 
rithm which takes as input accuracy and confidence parameters e, 5 and can invoke 
MEM(/). We say that A learns C under the uniform distribution on [6]" provided 
that given any < e, 5 < 1 and access to MEM(/), with probability at least 1 — 5 A 
outputs an e-approximating hypothesis h: — 1,1} (which need not belong 

to €) such that Pra:g[6]n[/(x) = h{x)] > 1 — e. 

We are interested in computationally efficient learning algorithms. We say that A 
learns £ efficiently if for any target concept / G C„ 6, 

• ^ runs for at most poly (n, log 6, 1/e, log 1/5) steps; 

• Any hypothesis h that A produces can be evaluated at any x E [6]" in at most 
poly (n, log 6, 1/e, log 1/5) time steps. 

2.2 The functions we study 

The reader might wonder which classes of Boolean valued functions over [6]" are 
interesting. In this article we study classes of functions that are defined in terms 
of "5-literals"; these include rectangles and unions of rectangles over [6]" as well 
as other richer classes. As described below, 6- literals are a natural extension of 
Boolean literals to the domain [6]". 

Definition 2.1 A function i: [b] {—1, 1} is a basic 6-literal if for some a G 
{ — 1, 1} and some a < (3 with a,f3E [b] we have i{x) = a if a < x < f3, and 
i{x) = —a otherwise. A function i: [b] — > { — 1, 1} a 6-literal if there exists a 
basic b-literal i' and some fixed z G [6], gcd(2;, h) = 1 such that for all x G [&] we 
have i[x) = i'{xz mod b). 

Basic 6-literals are the most natural extension of Boolean literals to the domain [6]". 
General 6- literals (not necessarily basic) were previously studied in [Ij and are also 
quite natural. 
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Example 2.2 If bis odd then the least significanthit function I sb{x): [b] {—1,1}, 
defined by lsb{x) = —1 iffx is even, is a b-literal. 

To see this, let z = (2)"^ mod b (this value exists since b is odd). Let E = 
{0, 2, 4, . . . , 6 — 1} denote the set of all the even residues in [b], i.e. E is precisely 
the set of inputs that are mapped to —1 under Isb. We have 

b-l . 

E = {0-2, 1-2,... ^-2} 

and consequently 

E ■ z mod 6 = {0 ■ 2 ■ 2"^ mod 6, 1 ■ 2 ■ 2~^ mod 6, . . . , •2-2"^ mod 6} 

6-1 

^{0,1,2,...,^}. 

The function i'{x) which equals — 1 iff x G {0, 1, . . . ^} is a basic 6-literal, and 
consequently Isb(x) = i'{xz mod b) is a 6- literal. 

Definition 2.3 A function f : [6]" {—1,1} is a /^-dimensional rectangle if it is 

an And ofk basic b-literals £i, . . . ,£k over k distinct variables Xi^, . . . ,Xi^. If f is 
a k-dimensional rectangle for some k then we may simply say that f is a rectangle. 
A union of s rectangles Ri, . . . , Rg is a function of the form f{x) = Owl^iRi{x). 

The class of unions of s rectangles over [6] " is a natural generalization of the class 
of s-term DNF over {0, 1}". Similarly Majority of Parity of basic 6-literals 
generalizes the class of Majority of Parity of Boolean literals, a class which 
has been the subject of much research (see e.g. II12I4I15II ). 

If G is a logic gate with potentially unbounded fan-in (e.g. Majority, Parity, 
And, etc.) we write "s-way G" to indicate that the fan-in of G is restricted to be at 
most s. Thus, for example, an "s-way Majority of r-way Parity of 6-literals" 
is a Majority of at most s functions gi, . . . ,gs, each of which is a Parity of 
at most r many 6-literals. We will further assume that any two b-literals which are 
inputs to the same gate depend on different variables. This is a natural restriction 
to impose in light of our ultimate goal of learning unions of rectangles. Although 
our results hold without this assumption, it provides simplicity in the presentation. 

23 Harmonic analysis of functions over \bY 

We will make use of the Fourier expansion of complex valued functions over [6]". 

Consider f,g: [6]" — > C endowed with the inner product (/,(?) = E[/^] and in- 
duced norm ||/|| = J (/, /). Let ujh = and for each a = (ai, . . . , an) G 
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let Xa'- C be defined as 

Let B denote the set of functions B = {Xa- ce G [6]"}. It is easy to verify the 
following properties: 

• Elements in B are normal: for each a = (ai, . . . , a„) G [6]", we have = 1- 

f 1 if a = /5 

• Elements in B are orthogonal: For a,P E [6]", we have {xa, Xp) = \ 

[Oif a ^13 

• B constitutes an orthonormal basis for all functions {/: [6]" — > C} considered 
as a vector space over C. Thus every / : [6]" C can be expressed uniquely as: 

a 

which we refer to as the Fourier expansion or Fourier transform of /. 

The values {/(a) : a e [b]" } are called the Fourier coefficients or the Fourier spec- 
trum of /. As is well known, Parseval's Identity relates the values of the coefficients 
to the values of the function: 

Lemma 2.4 (Parseval's Identity) E« = ^[\f\'^]for any /: [bf C. 

We write Li{f) to denote Y.a \f{o') \ -^oo(/) to denote maxa 
We will also make use of the following simple fact: 
Observation 2.5 For any f,h: [6]" ^ C and V over [6]", 

\E^[fh]\ = |Ea,[/E/i(a)xJI = | E^E,,[/x^]| < max |Ea,[/x^]|. 

a a " 

2.4 Additional tools: weak hypotheses and boosting 

Definition 2.6 Let f : [6]"— >{— 1,1} and T) bea probability distribution over [6]". 
A function g : [6]" — > [—1, 1] is said to be a weak hypothesis for / with advantage 
7 under T) ifE-jDlfg] > 7- 

The first boosting algorithm was described by Schapire ||20l in 1990; since then 
boosting has been intensively studied (see |[8l for an overview). The basic idea is 
that by combining a sequence of weak hypotheses hi,h2, . . . (the i-th of which 
has advantage 7 with respect to a carefully chosen distribution Dj) it is possible 
to obtain a high accuracy final hypothesis h which satisfies Pr[/i(x) = f{x)] > 
1 — e. The following theorem, which can be obtained easily from the results of [|2T1 
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Section 2.3], gives a precise statement of the performance guarantees of a particular 
boosting algorithm, which we call Algorithm B. Many similar statements are now 
known about a range of different boosting algorithms but this is sufficient for our 
purposes. 

Theorem 2.7 (Boosting Algorithm Il2l1l ) Suppose that Algorithm B is given: 

• < e, 5 < 1, and membership query access MEM(/) to f : [6]" — » { — 1, 1}; 

• access to an algorithm WL which has the following property: given a value 6' 
and access to MEM(/) and to EX(/, D) (the latter is an example oracle which 
generates random examples from [6]"^ drawn with respect to distribution D), it 
constructs a weak hypothesis for f with advantage 7 under D with probability 
at least 1 — 6' in time polynomial in n, log b, log(l/5'). 

Then Algorithm B behaves as follows: 

• It runs for S = 0{l/ej^) stages and runs in total time polynomial in n, log 6, 
e~\ 7-\ log(ri). 

• At each stage I < j < S it constructs a distribution T)j such that Loo{Dj) < 
poly(e^^)/6", and simulates EX(/, "Dj) for WL in stage j. Moreover, there is a 
value c G [1/2, 3/2] (the precise value of c depends on Dj and is not known to 
the algorithm) and a fixed "pseudo-distribution" Dj satisfying T>j(x) = cT>j{x) 
for all X, such that Dj (x) can be computed in time polynomial in n log bfor each 
X e [6]". 

• It outputs a final hypothesis h = sign(/ii + /i2 + • • • + hs) which e-approximates 
f under the uniform distribution with probability 1 — 6; here hj is the output of 
WL at stage j invoked with simulated access to EX(/, T)j). 

We will sometimes informally refer to distributions D which satisfy the bound 
Loo{D) < as smooth distributions. 

In order to use boosting, it must be the case that there exists a suitable weak hy- 
pothesis with advantage 7. In this paper we will use the "discriminator lemma" of 
Hajnal et al. [lOl (see also lfT9l ) at various points (see e.g. the proofs of Theorem l4.5l 
and Lemma HTSl) to assert that the desired weak hypothesis exists: 

Lemma 2.8 (The Discriminator Lemma II10|19I 1) Let Sj be a class of ±1 -valued 
functions over [6]" and let f : [b]"^ ^ {—1,1} be expressible as 

f = Majority (/ii, ...,hs) 

where each hi & and hi (x) + . . . + hs{x) 7^ Ofor all x. Then for any distribution 
D over [6]" there is some hi such that |Ex)[//ij]| >l/s. 
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3 The Generalized Harmonic Sieve Algorithm 



In this section our goal is to describe a variant of Jackson's Harmonic Sieve Algo- 
rithm and show that under suitable conditions it can efficiently learn certain func- 
tions /: [6]" {—1) !}• As mentioned earlier, our aim is to attain poly(log6) 
runtime dependence on b and consequently obtain efficient algorithms as described 
in Section [21 This goal precludes using Jackson's original Harmonic Sieve variant 
for [6]" since the runtime of his weak learner depends polynomially rather than 
polylogarithmically on b (see [[I2l Lemma 15]). 

As we describe below, this poly (log 6) runtime can be achieved by modifying the 
Harmonic Sieve over [6]" to use a weak learner due to Akavia et al. [1] which is 
more efficient than Jackson's weak learner. We shall call the resulting algorithm 
"The Generalized Harmonic Sieve" algorithm, or GHS for short. 

Recall that in the Harmonic Sieve over the Boolean domain { — 1, 1}", the weak 
hypotheses used are simply the Fourier basis elements over { — 1, 1}", which cor- 
respond to the Boolean-valued parity functions. For [6]", we will use the real com- 
ponent of the complex- valued Fourier basis elements {xa, « ^ (as defined in 
Section |23l ) as our weak hypotheses. 

The following theorem of Akavia et al. [HI Theorem 5] will play a crucial role 
towards construction of the GHS algorithm. 

Theorem 3.1 (See III) There is a learning algorithm that, given membership query 
access to f : [6]" ^ C, < 7 and < 6 < 1, outputs a list L of indices such that 
with probability at least 1 — 6, we have {a: \ f{a)\ > 7} C L and > f Z'^'' 

every (3 E L. The running time of the algorithm is polynomial in n, log 6, ||/||oo, 

7-\ log(ri). 

Lemma 3.2 (Construction of the weak hypothesis) Given 

• Membership query access MEM(/) to f : [6]" { — 1, 1}." 

• A smooth distribution T); more precisely, access to an algorithm computing 
D(a:) in time polynomial in n, log b for each x G [&]". Here T) is a "pseudo- 
distribution" for T) as in Theorem 12. 71 i.e. there is a value c G [1/2, 3/2] such 
thatT){x) = cTi{x) for all X. 

• A value < 7 < 1/2 such that there exists an element of the Fourier basis Xt 
satisfying |Ed[/x7]| > 7- 

there is an algorithm that outputs a weak hypothesis for f with advantage 7/4 
under D with probability 1 — 6 and runs in time polynomial in n, log 6, e^^, 7"^, 
log(5-i). 
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PROOF. Let/*(a;) = b'"b{x)f{x). Observe that 
• Since T> is smooth, ||/*||oo < poly(e"^). 

. For any a G [6]", = E[/,x^] = ^ E &'^i)(x)/(a;)^^ = E^^ic/x^]. 

Therefore one can invoke the algorithm of Theorem 13. II over f^{x) by simulating 
MEM(/=k) via MEM(/), each time with poly(n, log6) time overhead, and obtain 
a list L of indices. Note that since we are guaranteed that there exists an index r 
satisfying |Ex)[/x7]| > 7 implying |/*(r)| > cy, we can invoke Theorem |3. II in 
such a way that for any index /? in its output, we know |/*(/?)| > C7/2. 

It is easy to see that the algorithm runs in the desired time bound and outputs a 
nonempty list L. Let (3 be any element of L. Since /*(/?) = 'E[b"''I){x)f{x)xi3ix)], 

one can approximate {^^]\ ~ \f[^]\ ~ ^'^ using uniformly drawn random ex- 
amples. Let e*^' be the approximation thus obtained. 

By assumption we know that for random x G [6]", the random variable 



always takes a value whose magnitude is 0(poly(e^^)) in absolute value. Using a 
straightforward Chemoff bound argument, this implies that \9 — 9'\ can be made 
smaller than any constant using poly(n, log 6, e"^) time and random examples. 

Now observe that we have 

^vim] = e^'|Ei,[/x^]| E^[/^] = |Ea,[/x^]| = c-%{(3)\ > 7/2. 

Therefore for a sufficiently small value of — 6^'!, we have 

Eb[/5R{^}] = 3?{E^[/^]} = 3?{e*(^-^') E^[/^] } > 7/4. 

real valued and > 7/2 



Since 3?{e*^'x/3} always takes values in [—1, 1], we conclude that ^{e'^^'xp} con- 
stitutes a weak hypothesis for / with advantage 7/4 under D with high probabil- 
ity. □ 



Rephrasing the statement of Lemma 13.21 now we know: As long as for any func- 
tion / in the concept class it is guaranteed that under any smooth distribution T> 
there is a Fourier basis element xp that has non-negligible correlation with / (i.e. 
|Ex)[/Xa]| > 7)5 then it is possible to efficiently identify and use such a Fourier 
basis element to construct a weak hypothesis. 
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Now one can invoke Algorithm B from Theorem 12.71 as in Jackson's original Har- 
monic Sieve: At stage j, we have a distribution Dj over [6]" for which Loo(2)j) < 
poly(e^^)/6". Thus one can pass the values of Dj to the algorithm in Lemma [X2l 
and use this algorithm as WL in Algorithm B to obtain the weak hypothesis at each 
stage. Repeating this idea for every stage and combining the weak hypotheses gen- 
erated for all the stages as described by Theorem 12. 7[ we have the GHS algorithm: 



Corollary 3.3 (The Generalized Harmonic Sieve) Let Cbe a concept class. Sup- 
pose that for any concept f G Cn,b and any distribution D over [6]" with L^o (D) < 
poly(e^^)/6" there exists a Fourier basis element Xa such that |Ed[/x^]| > 7. 
Then €. can be learned in time poly(n, log&, e"^, 7"^). 



4 Learning Majority of Parity using GHS 



In this section we identify classes of functions which can be learned efficiently 
using the GHS algorithm and prove Theorem ll.il 

Let C° denote the concept class ofTheorem ll.il the concept class of s-way Major- 
ity of r-way Parity of 6-literals where s = poly(?T. log 6), r = 0{j^^^^^f^). 

To prove Theorem ll.il we show that for any concept / G C° and under any smooth 

distribution there must be some Fourier basis element which has high correlation 

with /; this is the essential step which lets us apply the Generalized Harmonic 

Sieve. We prove this in Section 1421 In Section |431 we give an alternate argument 

which yields a Theorem 11.11 analogue but with a slightly different bound on r, 
namely r = 0(i^fjM). 



4.1 Setting the stage 



In this section we first focus our attention to functions defined over [b], i.e. the case 
n = 1. 

For ease of notation we will write abs{a) to denote min{a, b — a}. We will use the 
following simple lemma from [|T]|: 

Lemma 4.1 (See [1]) For allO<i<b, we have \ EJ=o ^71 < b/abs{a). 

Corollary 4.2 Let f : [b] { — 1,1} be a basic b-literal. Then if a = 0, |/(a)| < 1, 
while if a ^0, I /(a) I < 
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PROOF. The first inequality follows immediately from Parseval's Identity given 
in Lemma because / is {1, — l}-valued. For the latter, note that |/(a)| = 

\nm\ = 



1 




1 




1 


E Xaix) 


b 


< - 
- b 


n 




X6/-1(1) 





where the inequality is simply the triangle inequality. It is easy to see that each of 
the sums on the RHS above equals j \uj^^\ \ J2lzJo^b^\ = l \ Ey=o^r^| for some 
suitable c and i < b, and hence Lemma 1431 gives the desired result. □ 



The following easy lemma is useful for relating the Fourier transform of a 6- literal 
to the corresponding basic 6-literal: 

Lemma 4.3 For f,g: [b] C such that g{x) = f{xz) where gcd(2;, 6) = 1, we 
have g (a) = f{az~^). 



PROOF. 



g{a) = B4g{x)xa{x)] = B4f{xz)xa{x)] = E^^-i[f{x)xaixz~^)] 

= E^,-i[f{x)xaz-^{x)] = E^[f{x)xaz-^{x)] = f{az~^). □ 



A natural way to approximate a 6-literal is by truncating its Fourier representation. 
We make the following definition: 

Definition 4.4 Let k be a positive integer For f : [b] { — basic b-literal, 
the /^-restriction of / is f : [b] C, f{x) = Eabs(a)<k f {(^)Xa{x) ■ More gen- 
erally, for f : [b] { — 1,1} a b-literal (so f\x) = f'{xz) where f is a basic 
b-literal) the fc-restriction of / is f: [b] C, f{x) = J2abs{az-^)<k fia)Xaix) = 

T.abs(!3)<kf'{!3)Xp{xz). 



4.2 There exist highly correlated Fourier basis elements for functions in £° under 
smooth distributions 



In this section we show that given any / G the concept class of Theorem 1 1.1[ 
and any smooth distribution D, some Fourier basis element must have high correla- 
tion with /. In more detail, the main result of this section is the following theorem: 

Theorem 4.5 Let t > 1 be any value, and let € be the concept class consisting 
of s-way Majority of r -way Parity of b-literals where s = poly(r) and r = 
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^^ iog'iog^;T) )" Then for any f G Cn,b and any distribution D over [6]" with LooiD) 
poly(r) /b"', there exists a Fourier basis element Xa such that 

|Ei,[/x;r]| >^](l/poly(r)). 



We prove the theorem after some preliminary lemmata about approximating basic 
6-literals and products of basic fe-literals. We begin by bounding the error of the 
/c-restriction of a basic 6-literal: 

Lemma 4.6 For / : [b] { — l,l}a b-literal and f the k-restriction of f, we have 

m - m = 8 A ^nd E[\f -f\]< .fs/k. 



PROOF. Without loss of generality assume / to be a basic 6-literal. By an imme- 
diate application of Lemma 12 .41 (Parseval' s Identity) we obtain: 

_ „ CO A 1 8 

E[i/-/r]= E i/(«)r < 2. E —<8 -d^=-. 

abs(a)>k ri-^ m=k+l 771^ Jk fc 

by Corollary 14.21 



By the non-negativity of variance, this implies 'E[\f — f\] < JS/k. □ 



Now suppose that / is an r-way PARITY of 6-literals fi, . . . , f^. Since PARITY cor- 
responds to multiplication over the domain { — 1, 1}, this means that / = IlLi ft- 
It is natural to approximate / by the product of the fc-restrictions ni=i fi- The fol- 
lowing lemma bounds the error of this approximation: 

Lemma 4.7 For i = 1, . . . , r, let fi: [b] { — 1,1} be a b-literal and let fi be its 
k-restriction. Then 



E[|/l(Xi)/2(a;2) . ..friXr) - h{Xl)Ux2) • . . fr{Xr)\] < 6^8^ _ 1. 

PROOF. First note that by Lemma |431 we have that for each i = 1, . . . , r: 



Therefore we also have for each i = 1, . . . , r: 



E,,[|/,(x,)|] < E.J|/,(a;,) - /,(a;,)|] + E.J|/,(x,)|] < 1 + J^/k 



<y/S/k 

For any (xi, . . . , x^) we can bound the difference in the lemma as follows: 
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\fl{xi) . . . fr{Xr) - fl{Xi) . . . friXr)\ < 

|/l(Xi) . . . /^(X^) - fliXi) . . . fr-liXr^l)fr{Xr)\ + 

IfliXi) . . . fr~liXr-l)friXr) - /l (a;i) . . . /r(x,) | < 

|/r(a;r) - friXr)\ + \ friXr)\\ fl{xi) . . . ) - fi{xi) . . . fr-l{Xr-l)\ 

Therefore the expectation in question is at most: 

E[|/,.K) - /,(xv)|] + E[|/,(x,)|] ■E(,,....,,^_^)[|/i(xi) . . . - . . . 

We can repeat this argument successively until the base case 

E.J|/i(xi)-/i(xi)|]<y8A 
is reached. Thus one obtains the upper bound 

E[|/i(xi) . . . friXr) - h{Xl) . . . fr{Xr)\] < ^£(1 + V^)' 

i=0 

= (1 + ^J%/ky - 1 < e"V8A _ 1. □ 

Now we are ready to prove Theorem l4.5l which asserts the existence (under suitable 
conditions) of a highly correlated Fourier basis element. The basic approach of the 
following proof is reminiscent of the main technical lemma from [13J. 

PROOF OF THEOREM HH Assume / is a Majority of /ii, . . . , /i, each of 
which is a r-way Parity of 6-literals. Then Lemma [Z8] implies that there exists hi 
such that |ED[//ij]| > 1/s. Let hi be Parity of the ^-literals ^i, . . . , 4- 

Since s and 6" ■ Loo(D) are both at most poly(r) and r = 0{^^^^^), Lemma 1477] 
implies that there are absolute constants Ci,C2 such that if we consider the k- 
restrictions £i, of £i, for ^ = CyT^'^, we will have E[|/ij — 11^=1 < 

l/(2s6"Loo(I')) where the expectation on the left hand side is with respect to the 
uniform distribution on [6]". This in turn implies that Exi[|/ij — 11^=1 < l/2s. 
Let us write h' to denote 11^=1 ^j- We then have 



|Ei,[//i']| > \E^[fhi]\ - \E^[f{K - h')]\ > \E^[fh,]\ - ^T^Wfihi - h')\\ 
= \E^[fhi]\ - E^[\h, - h'\] > 1/s - l/2s = l/2s. 

By Observation [23] we additionally have 

|Eb[/F]| = I < Li(/i')max|Ea,[/x^]|. 
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Moreover, for each j = 1, . . . , r we have the following (where we write d'^ to denote 
the basic 6-literal associated with the 6-literal (.j): 



Li{ij)= E \^'j{a)\ < 1 + 2 E 2/m<5 + 41n(A; + l). 

abs{a)<k ""-^ "^=1 

by Corollary IfkZ] 

Therefore, for some absolute constant c > we have Li{h') < 11^=1 -^i(^j) < 
(clog ky, where the first inequality holds as a consequence of the elementary fact 
that the Li norm of a product is at most the product of the Li norms of the compo- 
nents. Combining inequalities, we obtain 

max|Ei,[/x^]| > l/(2s(clogfc)0 = r](l/poly(r)) 
which is the desired result. □ 



Since we are interested in algorithms with runtime poly(?2, log&, e~^), setting r = 
ne^^ log b in Theorem 14.51 and combining its result with Corollary 13. 3[ gives rise to 
Theorem ll.il 



4.3 The second approach 



A different analysis, similar to that which Jackson uses in the proof of lfT2l Fact 
14], gives us an alternate bound to Theorem 14. 5 [ 

Lemma 4.8 Let €. be the concept class consisting of s-way Majority of r-way 
Parity ofh-literals. Then for any f G Cn,b and any distribution D over [6]", there 
exists a Fourier basis element Xa such that |Exi[/x^]| = ^1(1/ s{\ogby). 



PROOF. Assume / is a MAJORITY of /ii, . . . , /i^ each of which is a r-way PARITY 
of 6-literals. Then Lemma [Z8] implies that there exists hi such that |Ex)[//ii]| > 
Let hi be Parity of the 6-literals £i, . . . , 4- Observation l2.5l gives: 

1/s < \E^[fh,]\ = lE^Wm < L,{h,)m^x\E^[fx^]\ 

Also note that for j = 1 , . . . , r we have the following (where as before we write i'j 
to denote the basic 6- literal associated with the 6-literal ij): 

LiiQ = E\fya)\ < 1 + 2- E'2/m<5 + 41n6. 

I I a m=l 

by Lemma 14. 3 1 by Corollary |4.2| 

Therefore for some constant c > we have Li{hi) < 11^=1 Li{£j) = 0{{\ogbY), 
from which we obtain max^^ |Ex)[/x^] | = fi(l/s(log bY). □ 
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Combining this result with that of Corollary 13 .31 we obtain the following result: 

Theorem 4.9 The concept class consisting of s -way Majority of r -way Par- 
ity of b-literals can be learned in time poly(s,n, (log 6)'') using the GHS algo- 
rithm. 

As an immediate corollary we obtain the following close analogue of Theorem ll.il 

Theorem 4.10 The concept class C consisting ofs-way MAJORITY of r -way Par- 
ity of b-literals where s = poly(nlog6), r = O ( ^"f^^i^^l^^ ) is efficiently learnable 
using the GHS algorithm. 



5 Locating sensitive elements and learning with GHS on a restricted grid 

In this section we consider an extension of the GHS algorithm which lets us achieve 
slightly better bounds when we are dealing only with basic 6- literals. Following 
an idea from [3J, the new algorithm works by identifying a subset of "sensitive" 
elements from [6] for each of the n dimensions. 

Definition 5.1 (See ||3]]) A value a E [b] is called i-sensitive with respect to / : [6]" 
^{ — 1,1} if there exist values Ci, C2, . . . , Cj_i, Cj+i, . . . , c„ G [b] such that 

f{ci, . . . , (o- - 1) mod 6, Ci+i, . . . , c„) 7^ /(ci, . . . , cr, . . . , Cn). 

A value a is called sensitive with respect to f if a is i-sensitive for some i. If there 
is no i-sensitive value with respect to f, we say index i is trivial. 

The main idea is to run GHS over a restricted subset of the original domain [6]", 
which is the grid formed by the sensitive values and a few more additional values, 
and therefore lower the algorithm's complexity. 

Definition 5.2 A grid in [b]^ is a set § = Li x L2 x ■ ■ ■ x Ln with E Li C [b] 
for each i. We refer to the elements of§ as comers. The region covered by a corner 
(xi, . . . , Xn) € S is defined to be the set {{yi, . . . , yn) G [6]" : Wi, Xi < yi < \xi~\ } 
where [xj] denotes the smallest value in Li which is larger than xi (by convention 
\xi\ := b if no such value exists). The area covered by the corner (xi, . . . , x„) G S 
is therefore defined to be ]Yi=i{.\xi \ — Xi). A refinement o/S is a grid in [6]" of the 
form L[ X L2 X ■ ■ ■ X L'^ where each Li C L'^. 

Lemma 5.3 Let S be a grid Li x L2 x ■ ■ ■ x L„ in [6]" such that each < i. 
Let X§ denote the set of indices for which L^ 7^ {0}. If |X§| < n, then § admits a 
refinement §' = L[ x L'2 x ■ ■ ■ x L'^ such that 
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Algorithm 1 Computing a refinement of the grid S with the desired properties. 

for all 1 < i < n do 
if Li = {0} then 

^ - {0}. 
else 

Consider Lj = {xq, x\, . . . , x}_i}, where < x\ < • ■ ■ < x\_]^ (Also let 
x\ = h). 

SetL; ^ Li andr ^ \ h/4:Kl\. 
for all r = 0, 1 do 

if - x^l > T then 

L'j ^ L'j U {x\. + r, X* + 2r, . . .} (up to and including the largest 
x\.+ j - T which is less than x\.j^{) 
end if 
end for 

if |L-| > Lmax then 
T <_ I r'l 

-'-'max ^ I I • 

end if 
end if 
end for 

for alll < i < n with \L[\ > 1 do 
while < L 

max 

) do 

L- ^ L - U {an arbitrary element from \b]}. 
end while 
end for 

8^ ^ l; X X • • • X L'^. 

(1) All of the sets L\ which contain more than one element have the same number 
of elements: Lmax, which is at most i + Cni, where C = ■ > 4. 

(2) Given a list of the sets Li, . . . , Ln as input, a list of the sets L[, . . . , L'^ can be 
generated by an algorithm with a running time ofO{nK.i log b). 

(3) V- = {0} whenever U = {0}. 

(4) Any e fraction of the comers in $' cover a combined area of at most 2e6". 



PROOF. Consider Algorithm 1 which, given § — Li x L2 x ■ ■ ■ x Ln, generates 

The purpose of the code between lines 18-22 is to make every L[ ^ {0} con- 
tain equal number of elements. Therefore the algorithm keeps track of the number 
of elements in the largest L[ in a variable called L^ax and eventually adds more 
(arbitrary) elements to those L- 7^ {0} which have fewer elements. 

It is clear that the algorithm satisfies Property 3 above. 
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Now consider the state of Algorithm 1 at line 18. Let i be such that \L[\ = L^ax- 
Clearly L'^ includes the elements in Lj which are at most i many. Moreover every 
new element added to L'^ in the loop spanning lines 8-12 covers a section of [b] of 
width T, and thus b/r = CkE elements can be added. Thus L^ax < i + CkE. At 
the end of the algorithm every L[ contains either 1 element (which is {0}) or L^ax 
elements. This gives us Property 1. Note that C > 4 by construction. 

It is easy to verify that it satisfies Property 2 as well (the log b factor in the runtime 
is present because the algorithm works with (log 6)-bit integers). 

Property 1 and the bound \Ig\ < k together give that the number of comers in S 
is at most (£ + CkE)'^. It is easy to see from the algorithm that the area covered by 
each corner in S' is at most j^r^;^ (again using the bound on |X§|). Therefore any e 
fraction of the corners in S' cover an area of at most: 

ft" IK 

e{i + CtiiY X — — = e(l + — ) X 6" < e^'hb'' < 2e6". 

^ ' C>4 

This gives Property 4. □ 



The following lemma is easy and useful; similar statements are given in [[3]|. Note 
that the lemma critically relies on the 6- literals being basic. 

Lemma 5.4 Let f : [b]"^ {-1,1} be expressed as an s-way MAJORITY of Par- 
ity of basic b-literals. Then for each index 1 < i < n, there are at most 2s i- 
sensitive values with respect to f. 



PROOF. A literal i on variable Xj induces two z-sensitive values. The lemma fol- 
lows directly from our assumption (see Section [2l) that for each variable Xj, each 
of the s Parity gates has no more than one incoming literal which depends on 

Xi. □ 



Algorithm 2 is our extension of the GHS algorithm. It essentially works by repeat- 
edly running GHS on the target function / but restricted to a small (relative to [6]") 
grid. To upper bound the number of steps in each of these invocations we will be re- 
ferring to the result of Theorem 14. 101 After each execution of GHS, the hypothesis 
defined over the grid is extended to [6]" in a natural way and is tested for e-accuracy. 
If h is not e-accurate, then a point where h is incorrect is used to identify a new sen- 
sitive value and this value is used to refine the grid for the next iteration. The bound 
on the number of sensitive values from Lemma 15.41 lets us bound the number of 
iterations. Our theorem about Algorithm 2's performance is the following: 
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Algorithm 2 An improved algorithm for learning Majority of Parity of basic 
6-literals. 

1: Li^{0},L2^{0},...,L„^{0}. 

2: loop 

3: § ^ Li X L2 X ■ ■ ■ X Ln. 

4: §' ^ the output of refinement algorithm with input S. 

5: One can express S' = L'^ x L2 x ■ ■ ■ x If ^ {0} then L'- = 

{xi, x\..., x^L^^i)}. Let xl<x\<--- < xti and let n : Zl_. ^ L[ 

be the translation function such that rj(j) = x*. If Lj = L'^ = {0} then is 

the function simply mapping to 0. 
6: Invoke GHS over /|§/ with accuracy e/8. This is done by simulating 

MEM(/|s/(xi,...,x„)) with MEM(/(ri(xi),r2(x2),...,r,(x,))). Let the 

output of the algorithm be g. 
7: Let /i be a hypothesis function over [6]" such that /i(xi, . . . , x„) = 

g{Ti^{\xi\), . . . ■,T:^^{[xn\)) {\xi\ denotes largest value in L- less than or 

equal to Xi). 
8: \th e-approximates / then 
9: Output h and terminate. 
10: end if 

11: Perform random membership queries until an element (xi, . . . , Xn) G [6]" is 

found such that /( [xij , . . . , [x„J ) ^ /(xi, . . . , Xn). 
12: Find an index \ <i <n such that 

/([Xij, . . . , [Xi_iJ,Xi, . . . ,X„) 7^ /([a^lj, • • • , [Xi-l\, [XiJ,Xi+i, . . . ,Xn). 

This requires O(logn) membership queries using binary search. 
13: Find a value a such that [xjj + 1 < cr < and 

/([xij, . . . , [xi-i\,a - l,Xi+i,. . . ,x„) / f{[xi\,. . . , [xi_iJ,a,Xi+i,. . . ,x„). 

This requires 0(log6) membership queries using binary search. 
14: Li^LiU {a}. 
15: end loop 

Theorem 5.5 Let concept class € consist ofs-way Majority ofr-way Parity of 
basic h-literals such that s = poly(nlog6) and each f G Cn,b has at most n{n, h) 
non-trivial indices and at most i{n, h) i-sensitive values for each i = 1, . . . , n. Then 
€ is efficiently learnable ifr = 0( '°^^"^'°^^'' ). 

PROOF. We assume b = uj^ni) without loss of generality. Otherwise one imme- 
diately obtains the result with a direct application of GHS through Theorem 14. 10[ 

We clearly have k < n and ^ < 2s. By Lemma \5A\ there are at most ni = 0{ns) 
sensitive values. We will show that the algorithm finds a new sensitive value at each 
iteration and terminates before all sensitive values are found. Therefore the number 
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of iterations will be upper bounded by 0{ns). We will also show that each iteration 
runs in poly(n, log b, e^^) steps. This will establish the desired result. 

Let us first establish that step 6 takes at most poly(n, log b, e^^) steps. To observe 
this it is sufficient to combine the following facts: 

• Due to the construction of Algorithm 1 for every non-trivial index i of /, L[ has 
fixed cardinality = Lmax- Therefore GHS could be invoked over the restriction 
of / onto the grid, /!§/, without any trouble. 

• If / is s-way Majority of r-way Parity of basic 6-literals, then the function 
obtained by restricting it onto the grid: /|§' could be expressed as t-way MA- 
JORITY of M-way Parity of basic L-literals where t < s,u <r and L < O^ni) 
(due to the 1*^* property of the refinement). 

• Due to Theorem 14. 101 running GHS over a grid with alphabet size 0(/t£) in each 
non-trivial index takes poly(r2, log b, e~^) time if the dimension of the rectangles 
are r = 0{^^^^^). The key idea here is that running GHS over this ni-size 
alphabet lets us replace the "6" in Theorem |4. 1 01 with "k€\ 

To check whether if h e-approximates / at step 8, we may draw 0(l/e) ■ log(l/(5) 
uniform random examples and use the membership oracle to empirically estimate 
h's accuracy on these examples. Standard bounds on sampling show that if the 
true error rate of h is less than (say) e/2, then the empirical error rate on such a 
sample will be less than e with probability 1 — 5. Observe that if all the sensitive 
values are recovered by the algorithm, h will e-approximate / with high probability. 
Indeed, since g (e/8)-approximates /!§/, Property 4 of the refinement guarantees 
that misclassifying the function at e/8 fraction of the comers could at most incur 
an overall error of 2e/8 = e/4. This is because when all the sensitive elements are 
recovered, for every corner in §', h either agrees with / or disagrees with / in the 
entire region covered by that comer. Thus h will be an e/4 approximator to / with 
high probability. This establishes that the algorithm must terminate within 0{ns) 
iterations of the outer loop. 

Locating another sensitive value occurs at steps 11, 12 and 13. Note that h is not an 
e- approximator to / because the algorithm moved beyond step 8. Even if we were 
to correct all the mistakes in g this would alter at most e/8 fraction of the corners 
in the grid §' and therefore e/4 fraction of the values in h - again due to the 4*^ 
property of the refinement and the way h is generated. Therefore for at least 3e/4 
fraction of the domain we ought to have /( [^^ij , • • • , [a;„J ) 7^ f{xi, . . . , x„) where 
[xi\ denotes largest value in L'- less than or equal to Xi. Thus the algorithm requires 
at most 0(l/e) random queries to find such an input in step 11. 

Thus we have observed that steps 6, 8, 11, 12, 13 take at most poly(n, log 6, e^^) 
steps. Therefore each iteration of Algorithm 2 runs in poly(?2, log 6, e"^) steps as 
claimed. 
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We note that we have been somewhat cavalier in our treatment of the failure prob- 
abilities for various events. These include the possibility of getting an inaccurate 
estimate of /I's error rate in step 9, or not finding a suitable element (xi, . . . , x„) 
soon enough in step 11, or having the GHS algorithm fail to return a good hy- 
pothesis in one of its executions. A standard analysis shows that all these failure 
probabilities can be made suitably small so that the overall failure probability is at 
most 5 within the claimed runtime. □ 



6 Applications to learning unions of rectangles 

In this section we apply the results we have obtained in Sections |4] and |5] to obtain 
results on learning unions of rectangles and related classes. 



6.1 Learning majorities and unions of many low-dimensional rectangles 

The following lemma will let us apply our algorithm for learning MAJORITY of 
Parity of 6-literals to learn Majority of And of 6-literals: 

Lemma 6.1 Let /: {—1, 1}" {—1, 1} be expressible as an s-way Majority 
of r -way And of Boolean literals. Then f is also expressible as a 0{ns'^)-way 
Majority ofr-way Parity of Boolean literals. 

We note that Krause and Pudlak also gave a related but slightly weaker bound in 
lfT6l ; they used a probabilistic argument to show that any s-way Majority of And 
of Boolean literals can be expressed as an 0(n^s'^)-way Majority of Parity. 
Our boosting-based argument below closely follows that of [i12l Corollary 13]. 



PROOF OF LEMMA |01 Let / be the Majority of /ii, . . . , /i, where each hi 
is an And gate of fan-in r. By Lemma [Z8l given any distribution T) there is some 
And function hj such that |ED[//ij]| > 1/s. Moreover the Li-norm of any And 
function is at most 3. To see this observe that one can express And as follows: 

\i=l ^ / Vs'C{l,...,r} ^ / 

_ _ 2 2(-l)l^l 
|s'|>i 

Consequently Li(ANDr) < 1 + (2'') ■ -^f^j = 3 and thus we have Li{hj) < 3. 
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Now Observation [23] implies that there must be some parity function Xa such that 
|Ed[/x^] I ^ where the variables in Xa are a subset of the variables in hj - and 
thus Xa is a parity of at most r literals. As in the proof of [12, Corollary 13], we can 
now apply the boosting algorithm of [7|; this algorithm runs for 0(log(l/e)/7^) 
stages to construct an e-accurate final hypothesis if it is given a weak hypothesis 
with advantage 7 at each stage. We choose the weak hypothesis to be a Parity 
with fan-in at most r at each stage of boosting, and the above arguments ensure 
that each weak hypothesis has advantage at least 1 /As at every stage of boosting. 
If we boost to accuracy e = 2^r;:i' th^ii resulting final hypothesis will have zero 
error with respect to / and will be a Majority of 0(log(l/e)/s^) = 0{ns^) many 
r-way Parity functions. □ 

Note that while this argument does not lead to a computationally efficient construc- 
tion of the desired Majority of r-way Parity, it does establish its existence, 
which is all we need. 

Also note that any union (Or) of s many r-dimensional rectangles can be expressed 
as an 0(s)-way Majority of r-dimensional rectangles as well. 

Theorem 1 1.1 1 and Lemma [6711 together give us Theorem 1 1.2[ (In fact, these results 
give us leamability of s-way Majority of r-way And of 6-literals which need 
not necessarily be basic.) 

6.2 Learning unions of fewer rectangles of higher dimension 

We now show that the number of rectangles s and the dimension bound r of each 
rectangle can be traded off against each other in Theorem 1 1.21 to a limited extent. 
We state the results below for the case s = poly(log(ri log b)), but one could obtain 
analogous results for a range of different choices of s. 

We require the following lemma: 

Lemma 6.2 Any s-term r-DNF can be expressed as an r'^^^^^^'^'^-way Majority 
ofO{^/r\ogs)-way Parity of Boolean literals. 

PROOF. [[TSl Corollary 13] states that any s-term r-DNF can be expressed as an 
^o(v^iogs)_^^y Majority of 0(^logs)-way Ands. Now recall that the Fourier 
representation of an And of t variables is a linear combination of 2* Paritys 
(or negated Paritys), each with a coefficient of 1/2* (this Fourier representation is 
given explicitly in the proof of Lemma l6.1l) . Clearing this common denominator, we 
may simply replace each And that is input the Majority with the corresponding 
sum of 2* Paritys (or negated Paritys). This gives the lemma. □ 
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Now we can prove Theorem 1 1.31 which gives us roughly a quadratic improvement 
in the dimension r of rectangles over Theorem 11.21 when s = poly(log(n log b)). 



PROOF OF THEOREM O First note that by Lemma [531 any function in Cn,b 
(as defined by Section [2TI) can have at most k = 0{rs) = poly(log(nlog6)) non- 
trivial indices, and at most i = 0{s) = poly(log(nlog6)) many i-sensitive values 
for alH = 1, ... ,72. Now use Lemma |6]2] to express any function in C„b as an 
s'-way Majority of r'-way Parity of basic 6-literals where s' = r'^(^'°s'*) = 
poly(nlog6) and r' = O(v^logs) = C»(j^^ggggpy). Finally, apply Theo- 
rem 15.51 to obtain the desired result. □ 

Note that it is possible to obtain a similar result for learning poly(log(nlog6))- 
way union of Q( (iogiog(n'iog^fe))4 )"'^^y And of 6-literals if one were to invoke Theo- 
rem [TTTl 



6.3 Learning majorities of unions of disjoint rectangles 

A set {Ri, . . . , Rs} of rectangles is said to be disjoint if every input x G [6]" satis- 
fies at most one of the rectangles. Learning unions of disjoint rectangles over [6]" 
was studied by Q, and is a natural analogue over [b]"' of learning "disjoint DNF" 
which has been well studied in the Boolean domain (see e.g. II14I2II '). 

We observe that when disjoint rectangles are considered Theorem 11.21 extends to 
the concept class of majority of unions of disjoint rectangles. This extension relies 
on the easily verified fact that if fi, . . . , ft are functions from [6]" to {—1,1}'^ 
such that each x satisfies at most one fi, then the function Or(/i, ■ ■ ■ , ft) satisfies 
Li(Or(/i, ...Jt)) = 0{Li{fi)+- ■ ■+Li{ft)). This fact lets us apply the argument 
behind Theorem 14.51 without modification, and we obtain Corollary |1.4[ Note that 
only the rectangles connected to the same Or gate must be disjoint in order to 
invoke Corollary ll.4[ 



7 Conclusions and future work 

For future work, besides the obvious goals of strengthening our positive results, we 
feel that it would be interesting to explore the limitations of current techniques for 
learning unions of rectangles over [6]". At this point we cannot rule out the possibil- 
ity that the Generalized Harmonic Sieve algorithm is in fact a poly(n, s, log 6) -time 
algorithm for learning unions of s arbitrary rectangles over [6]". Can evidence for 
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or against this possibility be given? For example, can one show that the represen- 
tational power of the hypotheses which the Generalized Harmonic Sieve algorithm 
produces (when run for poly(n, s, log 6) many stages) is - or is not - sufficient to 
express high-accuracy approximators to arbitrary unions of s rectangles over [&]"? 
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